Clustering is an important unsupervised classification method which divides data into different groups based some similarity metrics. K-means becomes an increasing method for clustering and is widely used in different application. Centroid initialization strategy is the key step in K-means clustering. In general, K-means has three efficient initialization strategies to improve its performance i.e., Random, K-means++ and PCA-based K-means. In this paper, we design an experiment to evaluate these three strategies on UCI ML hand-written digits dataset. The experiment result shows that the three K-means initialization strategies find out almost identical cluster centroids, and they have almost the same results of clustering, but the PCA-based K-means strategy significantly improves running time, and is faster than the other two strategies.
References
[1]
Pfitzner, D., Leibbrandt, R. and Powers, D. (2009) Characterization and Evaluation of Similarity Measures for Pairs of Clusterings. Knowledge and Information Systems, 19, 361-394. https://doi.org/10.1007/s10115-008-0150-6
[2]
Kodinariya, T.M. (2014) Survey on Exiting Method for Selecting Initial Centroids in K-Means Clustering. International Journal of Engineering Development and Research, 2, 2865-2868.
[3]
Hamerly, G. and Elkan, C. (2002) Alternatives to the K-Means Algorithm that Find Better Clusterings. Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM), McLean, VA, 4-9 November 2002, 600-607. https://doi.org/10.1145/584792.584890
[4]
Celebi, M.E., Kingravi, H.A. and Vela, P.A. (2013) A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm. Expert Systems with Applications, 40, 200-210, arXiv:1209.1960.
https://doi.org/10.1016/j.eswa.2012.07.021
[5]
Bradley, P.S. and Fayyad, U.M. (1998) Refining Initial Points for K-Means Clustering. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, 24-27 July 1998, 91-99.
[6]
Vattani, A. (2011) K-Means Requires Exponentially Many Iterations Even in the Plane. Discrete and Computational Geometry, 45, 596-616.
https://doi.org/10.1007/s00454-011-9340-1
[7]
Arthur, D. and Vassilvitskii, S. (2007) K-Means++: The Advantages of Careful Seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, 1027-1035.
[8]
Ding, C. and He, X.F. (2004) K-Means Clustering via Principal Component Analysis. Proceedings of the Twenty-First International Conference on Machine Learning, Banff, Alberta, 4-8 July 2004, 29. https://doi.org/10.1145/1015330.1015408