%0 Journal Article
%T 面向集值型数据的无监督聚类方法及其应用
Unsupervised Clustering Method for Set-Valued Data and Its Application
%A 王旭
%A 马丽涛
%J Advances in Applied Mathematics
%P 318-330
%@ 2324-8009
%D 2025
%I Hans Publishing
%R 10.12677/aam.2025.141032
%X 分类问题是数据挖掘、机器学习等领域的基础性问题之一,然而多数分类方法仅关注向量值样本的分类问题,而对于实际中广泛存在的集值型数据样本的分类关注较少。本文提出了一种基于Wasserstein距离的无监督聚类算法(Wk-means),利用熵正则最优传输模型度量集值型数据点之间的距离,并结合聚类的思想设计了一个可用于集值型数据的Wk-means聚类方法。为验证方法的有效性,本文首先在几个公开数据集上进行了实验,结果证实了Wk-means在多样本、多类别、多特征的集值型数据中表现优异,并且通过统计检验表明本文算法与其他算法存在显著差异。随后将本文方法实际应用于滏阳河水质数据集,结果同样表明相比传统的数据聚类算法,Wk-means能够更准确地划分水质类别,且运行效率更高。本文提出的Wk-means算法在集值型水质数据的分类任务中表现出色,能够为环境监测和管理提供有价值的决策支持。
Classification is one of the basic problems in data mining, machine learning and other fields. However, most classification methods only focus on the vector-valued samples, while paying less attention to the classification of set-valued data samples that are widely existed in practice. This paper proposes an unsupervised clustering algorithm (Wk-means) based on Wasserstein distance. Combined with the idea of clustering, Wk-means can be used for set-valued samples, in which the entropy-regularized optimal transport model is used to measure the distance between set-valued samples. In order to verify the effectiveness of Wk-means, experiments are conducted firstly on several public data sets. The results confirm the excellent performance of Wk-means in set-valued data with multi-sample, multi-category, and multi-feature. Moreover, the statistical test show that Wk-means is significantly different from other algorithms. Wk-means is then applied to the Fuyang River water quality data set. The results also show that Wk-means can classify water quality categories more accurately and effectively than the traditional data clustering algorithm. The Wk-means algorithm proposed in this paper performs well in the classification task of set-valued water quality data and can provide valuable decision support for environmental monitoring and management.
%K 集值型数据,
%K 分类问题,
%K Wasserstein距离,
%K 最优传输,
%K 水质分类
Set-Valued Data
%K Classification Problem
%K Wasserstein Distance
%K Optimal Transport
%K Water Quality Classification
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=106424