Abstract:
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

Abstract:
We study how to learn multiple dictionaries from a dataset, and approximate any data point by the sum of the codewords each chosen from the corresponding dictionary. Although theoretically low approximation errors can be achieved by the global solution, an effective solution has not been well studied in practice. To solve the problem, we propose a simple yet effective algorithm \textit{Group $K$-Means}. Specifically, we take each dictionary, or any two selected dictionaries, as a group of $K$-means cluster centers, and then deal with the approximation issue by minimizing the approximation errors. Besides, we propose a hierarchical initialization for such a non-convex problem. Experimental results well validate the effectiveness of the approach.

Abstract:
Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian $K$-Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search.

Abstract:
Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents.This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment.

Abstract:
In this paper we provide a fully distributed implementation of the k-means clustering algorithm, intended for wireless sensor networks where each agent is endowed with a possibly high-dimensional observation (e.g., position, humidity, temperature, etc.) The proposed algorithm, by means of one-hop communication, partitions the agents into measure-dependent groups that have small in-group and large out-group "distances". Since the partitions may not have a relation with the topology of the network--members of the same clusters may not be spatially close--the algorithm is provided with a mechanism to compute the clusters'centroids even when the clusters are disconnected in several sub-clusters.The results of the proposed distributed algorithm coincide, in terms of minimization of the objective function, with the centralized k-means algorithm. Some numerical examples illustrate the capabilities of the proposed solution.

Abstract:
The k-means algorithm is a partitional clustering method. Over 60 years old, it has been successfully used for a variety of problems. The popularity of k-means is in large part a consequence of its simplicity and efficiency. In this paper we are inspired by these appealing properties of k-means in the development of a clustering algorithm which accepts the notion of "positively" and "negatively" labelled data. The goal is to discover the cluster structure of both positive and negative data in a manner which allows for the discrimination between the two sets. The usefulness of this idea is demonstrated practically on the problem of face recognition, where the task of learning the scope of a person's appearance should be done in a manner which allows this face to be differentiated from others.

Abstract:
\textit{Clustering problems} often arise in the fields like data mining, machine learning etc. to group a collection of objects into similar groups with respect to a similarity (or dissimilarity) measure. Among the clustering problems, specifically \textit{$k$-means} clustering has got much attention from the researchers. Despite the fact that $k$-means is a very well studied problem its status in the plane is still an open problem. In particular, it is unknown whether it admits a PTAS in the plane. The best known approximation bound in polynomial time is $9+\eps$. In this paper, we consider the following variant of $k$-means. Given a set $C$ of points in $\mathcal{R}^d$ and a real $f > 0$, find a finite set $F$ of points in $\mathcal{R}^d$ that minimizes the quantity $f*|F|+\sum_{p\in C} \min_{q \in F} {||p-q||}^2$. For any fixed dimension $d$, we design a local search PTAS for this problem. We also give a "bi-criterion" local search algorithm for $k$-means which uses $(1+\eps)k$ centers and yields a solution whose cost is at most $(1+\eps)$ times the cost of an optimal $k$-means solution. The algorithm runs in polynomial time for any fixed dimension. The contribution of this paper is two fold. On the one hand, we are being able to handle the square of distances in an elegant manner, which yields near optimal approximation bound. This leads us towards a better understanding of the $k$-means problem. On the other hand, our analysis of local search might also be useful for other geometric problems. This is important considering that very little is known about the local search method for geometric approximation.

Abstract:
There are two broad approaches for differentially private data analysis. The interactive approach aims at developing customized differentially private algorithms for various data mining tasks. The non-interactive approach aims at developing differentially private algorithms that can output a synopsis of the input dataset, which can then be used to support various data mining tasks. In this paper we study the tradeoff of interactive vs. non-interactive approaches and propose a hybrid approach that combines interactive and non-interactive, using $k$-means clustering as an example. In the hybrid approach to differentially private $k$-means clustering, one first uses a non-interactive mechanism to publish a synopsis of the input dataset, then applies the standard $k$-means clustering algorithm to learn $k$ cluster centroids, and finally uses an interactive approach to further improve these cluster centroids. We analyze the error behavior of both non-interactive and interactive approaches and use such analysis to decide how to allocate privacy budget between the non-interactive step and the interactive step. Results from extensive experiments support our analysis and demonstrate the effectiveness of our approach.

Abstract:
In this note, we introduce a new algorithm to deal with finite dimensional clustering with errors in variables. The design of this algorithm is based on recent theoretical advances (see Loustau (2013a,b)) in statistical learning with errors in variables. As the previous mentioned papers, the algorithm mixes different tools from the inverse problem literature and the machine learning community. Coarsely, it is based on a two-step procedure: (1) a deconvolution step to deal with noisy inputs and (2) Newton's iterations as the popular k-means.

Abstract:
This paper discusses the topic of dimensionality reduction for $k$-means clustering. We prove that any set of $n$ points in $d$ dimensions (rows in a matrix $A \in \RR^{n \times d}$) can be projected into $t = \Omega(k / \eps^2)$ dimensions, for any $\eps \in (0,1/3)$, in $O(n d \lceil \eps^{-2} k/ \log(d) \rceil )$ time, such that with constant probability the optimal $k$-partition of the point set is preserved within a factor of $2+\eps$. The projection is done by post-multiplying $A$ with a $d \times t$ random matrix $R$ having entries $+1/\sqrt{t}$ or $-1/\sqrt{t}$ with equal probability. A numerical implementation of our technique and experiments on a large face images dataset verify the speed and the accuracy of our theoretical results.