Clustering is an important unsupervised classification method which divides data into different groups based some similarity metrics. K-means becomes an increasing method for clustering and is widely used in different application. Centroid initialization strategy is the key step in K-means clustering. In general, K-means has three efficient initialization strategies to improve its performance i.e., Random, K-means++ and PCA-based K-means. In this paper, we design an experiment to evaluate these three strategies on UCI ML hand-written digits dataset. The experiment result shows that the three K-means initialization strategies find out almost identical cluster centroids, and they have almost the same results of clustering, but the PCA-based K-means strategy significantly improves running time, and is faster than the other two strategies.
Pfitzner, D., Leibbrandt, R. and Powers, D. (2009) Characterization and Evaluation of Similarity Measures for Pairs of Clusterings. Knowledge and Information Systems, 19, 361-394. https://doi.org/10.1007/s10115-008-0150-6
Hamerly, G. and Elkan, C. (2002) Alternatives to the K-Means Algorithm that Find Better Clusterings. Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM), McLean, VA, 4-9 November 2002, 600-607. https://doi.org/10.1145/584792.584890
Celebi, M.E., Kingravi, H.A. and Vela, P.A. (2013) A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm. Expert Systems with Applications, 40, 200-210, arXiv:1209.1960.
Ding, C. and He, X.F. (2004) K-Means Clustering via Principal Component Analysis. Proceedings of the Twenty-First International Conference on Machine Learning, Banff, Alberta, 4-8 July 2004, 29. https://doi.org/10.1145/1015330.1015408