|
- 2018
并行效率敏感的大规模SVM数据分块数选择
|
Abstract:
数据分块数的选择是并行/分布式机器学习模型选择的基本问题之一,直接影响着机器学习算法的泛化性和运行效率。现有并行/分布式机器学习方法往往根据经验或处理器个数来选择数据分块数,没有明确的数据分块数选择准则。提出一个并行效率敏感的并行/分布式机器学习数据分块数选择准则,该准则可在保证并行/分布式机器学习模型测试精度的情况下,提高计算效率。首先推导并行/分布式机器学习模型的泛化误差与分块数目的关系。然后以此为基础,提出折衷泛化性与并行效率的数据分块数选择准则。最后,在ADMM框架下随机傅里叶特征空间中,给出采用该数据分块数选择准则的大规模支持向量机实现方案,并在高性能计算集群和大规模标准数据集上对所提出的数据分块数选择准则的有效性进行实验验证。
Data segmentation is one of critical issues of model selection of parallel/distributed machine learning, which has impacts on generalization performance and parallel efficiency of parallel/distributed machine learning. Existing approaches to data segmentation of parallel/distributed machine learning are dependent on empirical evidences or on the number of the processors without explicit criterion. In this paper, we propose a parallel efficiency sensitive criterion of data segmentation with generalization theory guarantee, which improves the computational efficiency of parallel/distributed machine learning while retaining test accuracy. We first derive a generalization error upper bound with respect to the block number of the data segmentation. Then we present a data segmentation criterion that is a trade-off between the generalization error and the parallel efficiency. Finally, we implement large-scale Gaussian kernel support vector machines (SVMs) in the random Fourier feature space with the alternating direction method of multipliers (ADMM) framework on high-performance computing clusters, which adopt the proposed data segmentation criterion. Experimental results on several large-scale benchmark datasets show that the proposed data segmentation criterion is effective and efficient for the large-scale SVMs.