OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Advances in Applied Mathematics 2025

面向集值型数据的无监督聚类方法及其应用
Unsupervised Clustering Method for Set-Valued Data and Its Application

DOI: 10.12677/aam.2025.141032, PP. 318-330

王旭, 马丽涛

Keywords: 集值型数据，分类问题，Wasserstein距离，最优传输，水质分类
Set-Valued Data, Classification Problem, Wasserstein Distance, Optimal Transport, Water Quality Classification

Full-Text Cite this paper Add to My Lib

Abstract:

分类问题是数据挖掘、机器学习等领域的基础性问题之一，然而多数分类方法仅关注向量值样本的分类问题，而对于实际中广泛存在的集值型数据样本的分类关注较少。本文提出了一种基于Wasserstein距离的无监督聚类算法(Wk-means)，利用熵正则最优传输模型度量集值型数据点之间的距离，并结合聚类的思想设计了一个可用于集值型数据的Wk-means聚类方法。为验证方法的有效性，本文首先在几个公开数据集上进行了实验，结果证实了Wk-means在多样本、多类别、多特征的集值型数据中表现优异，并且通过统计检验表明本文算法与其他算法存在显著差异。随后将本文方法实际应用于滏阳河水质数据集，结果同样表明相比传统的数据聚类算法，Wk-means能够更准确地划分水质类别，且运行效率更高。本文提出的Wk-means算法在集值型水质数据的分类任务中表现出色，能够为环境监测和管理提供有价值的决策支持。
Classification is one of the basic problems in data mining, machine learning and other fields. However, most classification methods only focus on the vector-valued samples, while paying less attention to the classification of set-valued data samples that are widely existed in practice. This paper proposes an unsupervised clustering algorithm (Wk-means) based on Wasserstein distance. Combined with the idea of clustering, Wk-means can be used for set-valued samples, in which the entropy-regularized optimal transport model is used to measure the distance between set-valued samples. In order to verify the effectiveness of Wk-means, experiments are conducted firstly on several public data sets. The results confirm the excellent performance of Wk-means in set-valued data with multi-sample, multi-category, and multi-feature. Moreover, the statistical test show that Wk-means is significantly different from other algorithms. Wk-means is then applied to the Fuyang River water quality data set. The results also show that Wk-means can classify water quality categories more accurately and effectively than the traditional data clustering algorithm. The Wk-means algorithm proposed in this paper performs well in the classification task of set-valued water quality data and can provide valuable decision support for environmental monitoring and management.

References

[1]	李久生, 盛姣, 纪鉴航, 等. 基于KNN算法研究遥感图像地块分割与提取[C]//国家新闻出版广电总局中国新闻文化促进会学术期刊专业委员会. 2021年创新人才培养与可持续发展国际学术会议论文集(中文). 2021: 69-72.
[2]	张炎亮, 张超, 李静. 基于动态用户画像标签的KNN分类推荐算法研究[J]. 情报科学, 2020, 38(8): 11-15.
[3]	陈婷, 谢志龙. 基于改进决策树的不平衡数据集分类算法研究[J]. 计算机仿真, 2024, 41(8): 497-501.
[4]	韩彩娟. 基于决策树的制冷设备电子电路故障智能检测方法[J]. 电工技术, 2024(15): 140-142.
[5]	刘生富, 张鹏程, 周广宇, 等. 基于支持向量机与改进分水岭的红细胞识别算法研究[J]. 测试技术学报, 2022, 36(1): 48-53.
[6]	陶佳慧, 别雨轩, 顾约翰, 等. 基于多特征融合的SVM图像分类算法研究[J]. 上海航天(中英文), 2021, 38(S1): 98-102.
[7]	DeSanto, J.B. and Sandwell, D.T. (2019) Meter-Scale Seafloor Geodetic Measurements Obtained from Repeated Multibeam Sidescan Surveys. Marine Geodesy, 42, 491-506. https://doi.org/10.1080/01490419.2019.1661887
[8]	郝勇敢, 尚圆圆. 总磷总氮水质在线分析仪不确定度评定[J]. 广东化工, 2022, 49(1): 173-176.
[9]	陶莉. 水质监测中影响水质采样质量的因素及控制对策[J]. 清洗世界, 2022, 38(10): 109-111.
[10]	Ashino, K., Kamiya, N., Zhou, X., Kato, H., Hara, T. and Fujita, H. (2024) Joint Segmentation of Sternocleidomastoid and Skeletal Muscles in Computed Tomography Images Using a Multiclass Learning Approach. Radiological Physics and Technology, 17, 854-861. https://doi.org/10.1007/s12194-024-00839-1
[11]	Yoneyama, J. (2012) Robust Sampled-Data Stabilization of Uncertain Fuzzy Systems via Input Delay Approach. Information Sciences, 198, 169-176. https://doi.org/10.1016/j.ins.2012.02.007
[12]	Cortes, C. and Vapnik, V. (1995) Support-Vector Networks. Machine Learning, 20, 273-297. https://doi.org/10.1007/bf00994018
[13]	Li, Y.F., Tsang, I.W., Kwok, J.T., et al. (2013) Convex and Scalable Weakly Labeled SVMs. Machine Learning, 14, 2151-2188.
[14]	Quinlan, J.R. (1986) Induction of Decision Trees. Machine Learning, 1, 81-106. https://doi.org/10.1007/bf00116251
[15]	Arandjelovic, O., Shakhnarovich, G., Fisher, J., Cipolla, R. and Darrell, T. (2005) Face Recognition with Image Sets Using Manifold Density Divergence. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, 20-25 June 2005, 581-588. https://doi.org/10.1109/cvpr.2005.151
[16]	Villani, C. (2009) Optimal Transport: Old and New. Springer.
[17]	王森, 刘琛, 邢帅杰. K-Means聚类算法研究综述[J]. 华东交通大学学报, 2022, 39(5): 119-126.
[18]	Monge, G. (1781) Mémoire sur la théorie des déblais et des remblais. Académie Royale des Sciences (France), 666-704.
[19]	Kantorovitch, L. (1958) On the Translocation of Masses. Management Science, 5, 1-4. https://doi.org/10.1287/mnsc.5.1.1
[20]	范启哲. 基于最优传输与领域自适应的语义分割研究[D]: [硕士学位论文]. 西安: 西安理工大学, 2023.
[21]	张浩. 基于深度学习和最优传输的地震数据重构与全波形反演[D]: [博士学位论文]. 哈尔滨: 哈尔滨工业大学, 2021.
[22]	张琪. 基于最优运输理论的环境智适应无线网络研究[D]: [硕士学位论文]. 武汉: 华中科技大学, 2022.
[23]	张沙沙, 刘小弟, 张世涛. 基于Wasserstein测度的概率犹豫模糊聚类方法[J]. 模糊系统与数学, 2023, 37(6): 41-54.
[24]	晏远翔, 曹国, 张友强. 基于Wasserstein距离与生成对抗网络的高光谱图像分类[J]. 计算机系统应用, 2024, 33(2): 13-22.
[25]	苏连成, 朱娇娇, 郭高鑫, 等. 基于XGBoost和Wasserstein距离的风电机组塔架振动监测研究[J]. 太阳能学报, 2023, 44(1): 306-312.
[26]	Altschuler, J., Niles-Weed, J. and Rigollet, P. (2017) Near-Linear Time Approximation Algorithms for Optimal Transport via Sinkhorn Iteration. Neural Information Processing Systems, 2017, 1964-1974.
[27]	Schmitzer, B. (2019) Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems. SIAM Journal on Scientific Computing, 41, A1443-A1481. https://doi.org/10.1137/16m1106018
[28]	Lin, T., Ho, N. and Jordan, M. (2019) On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms. 2019 International Conference on Machine Learning, Long Beach, 9-15 June 2019, 3982-3991.
[29]	Shivaswamy, P., Bhattacharyya, C. and Smola, A. (2006) Second Order Cone Programming Approaches for Handling Missing and Uncertain Data. Machine Learning, 7, 1283-1314.
[30]	Zhu, P., Zuo, W., Zhang, L., Shiu, S.C. and Zhang, D. (2014) Image Set-Based Collaborative Representation for Face Recognition. IEEE Transactions on Information Forensics and Security, 9, 1120-1132. https://doi.org/10.1109/tifs.2014.2324277
[31]	Hu, Y., Mian, A.S. and Owens, R. (2011) Sparse Approximated Nearest Points for Image Set Classification. 2011 Conference on Computer Vision and Pattern Recognition, Colorado Springs, 20-25 June 2011, 121-128. https://doi.org/10.1109/cvpr.2011.5995500
[32]	Wang, Z. and Qiao, X. (2023) Set-Valued Classification with Out-of-Distribution Detection for Many Classes. Journal of Machine Learning Research, 24, 1-39.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

面向集值型数据的无监督聚类方法及其应用Unsupervised Clustering Method for Set-Valued Data and Its Application

面向集值型数据的无监督聚类方法及其应用
Unsupervised Clustering Method for Set-Valued Data and Its Application