Thursday, December 1, 2011

Power iteration clustering

I attended today a very interesting talk give by Prof. William Cohen from CMU (Given at HUJI). The talk describes a very simple spectral clustering method called power iteration clustering. A paper at the same name by Frank Lio and William Cohen appeared in last year's ICML 2010.

What I like about the method, is its simplicity and application to really large datasets. While some of the HUJI audience regarded the presentation quite skeptically since no theoretical results where presented to justify the great empirical performance, it seems that empirically the power iteration clustering method works just like spectral clustering. I guess that future theoretical results will soon to follow. And let me tell you a secret: some of the most successful methods deployed today in practice are simple methods like linear and (sparse) logistic regression, matrix factorization, K-means clustering etc.

In traditional spectral clustering, a square matrix describing the problem is constructed. (Typically the matrix is normalized s.t. the sum of each row equals to one). Then the K top eigenvectors  are computed. For example, if this is a social network then this matrix is the adjacency matrix. When the data is big, we can view this process of eigen decomposition as a dimensionality reduction step from sparse data in high dimension to a lower representation which is typically dense. The second step in spectral clustering is application of some clustering method like K-means to the eigenvectors. It was shown that typically data points which are close to each other has similar values in their eigenvectors.
That is why it makes sense to cluster those points together. 

Note: I am cheating a bit by not providing all the details. A great tutorial for spectral clustering is found here:

In contrast, the proposed power iteration clustering (PIC) computes power iteration using the data matrix. Since the sum of each row is equal to one, it can be thought of a weighted average calculation, where each node computes the average of its neighbors. Of course this process will lead to the same global average in all graph nodes. The main observation in the PIC algorithm, is that the process should be stopped early before every node in the graph has an equivalent sum.  When stopped at the right moment, each node store a cluster distinctive value. Intuitively, this is because nodes which are highly connected in a cluster push their intermediate results to be close to each other. Then the second step of K-means clustering is computed. This step is easier (relative to traditional spectral clustering), since we cluster just scalars in a single vector and not K dimensional vectors.

In terms of work, instead of computing K largest eigenvetors in the traditional spectral clsutering methods, PIC computes only a single eigenvector (which is actually a linear combination of several eigenvectors). So it is K times faster then traditional spectral clustering. 

Here is the algorithm:

The image above shows a demonstration of the power iteration method (taken from their ICML 2010 paper). (a) Shows a problem that K-means clustering has a difficulty operating on. (b) - (d) evolution of the power iteration method. Y-axis is the output scalar value of the output vector. The x-axis represent the different points. The colors are the output of the K-means clustering of scalars. While (b)-(c) are still not accurate, eventually the values of the vector converge very nicely to three clusters.
In principle,  if we would have continued to step (e) where all nodes have converged, we will get the same value in all nodes that would not have been useful.

Anyway, since the conjugate gradient, Lanczos and SVD methods implemented
in GraphLab (As well as K-meams) it is going to be very easy to implement this algorithm in GraphLab. I am planning to do it soon.


  1. based on my experiments, the performance on wine, usps and sona data sets are not good.

  2. Hi!
    This is the answer i got from Frank Lin: In general, PIC should work well where spectral clustering (spec. NCut) works well. With more clusters (perhaps usps) more than one pic vectors should be used.

    One caveat using these methods in general is how to do the similarity embedding, so that's my guess; but I'd like to try them and see why they don't work so well, if indeed they don't.

  3. Did you ever get this implemented?

    Sounds like it might also be a natural fit with pregel-like APIs e.g. Giraph. Sound plaisible?