Friday, September 9, 2011

GraphLab Clustering library

Recently I have been working on implementing a clustering library on top of GraphLab.
Currently we have K-means, Fuzzy K-means and LDA (Latent Dirichlet Allocation) implemented. I took some time for comparing performance of GraphLab vs. Mahout on an Amazon EC2 machine.

Here is a graph which compares performance:

Some explanation about the experiment. I took a subset of Netflix data with 3,298,163 movie ratings, 95,526 users, and 3,561 movies. The goal is to cluster user with similar movie preferences together. Both GraphLab and Mahout run on Amazon m2.xlarge instance .
This machine has 2 cores. I have used the following settings: 250 clusters, 50 clusters and 20 clusters. The algorithm runs a single iteration and then dumps the output into a text file.
For Mahout, I used Mahout's K-Means implementation. GraphLab was run using a single node, while Mahout was run using either one or two nodes. Mahout is using 7 mappers and GraphLab 7 threads.

Overall, GraphLab runs between x15 to x40 faster on this dataset.

A second experiment I did is to compare Mahout's LDA performance to GraphLab's LDA.
Here is the Graph:
For this experiment, I used m1.xlarge instance. I tested Graphlab on 4 cores, Mahout and 4 cores and Mahout on 8 cores (2 nodes). I used the same Netflix data subset, this time with 10 clusters. Graph depicts running time of a single iteration.

Finally, here are performance results of GraphLab LDA with 1, 2, 3 and 4 cores (on m1.xlarge EC2 instance):

Running time in this case is for 5 iterations.


  1. My pleasure! Keren from LexisNexis company is now testing GraphLab vs. their ML library. Stay tuned, I will post some performance results soon.



  2. Can i know how you checked the performance of mahout clustering?

    1. Those performance numbers are rather outdated. I would suggest looking at intel labs report:

      An additional resource comparing performance is here: