Friday, March 23, 2012

Large scale machine learning benchmark

About a week ago Joey and I had some interesting talk with Nicholas Kolegraff. Nicholas suggested a very useful initiative - to compile an EC2 Linux distribution with several machine learning frameworks installed, so one can very easily asses their performance on a set of benchmark problems.  Nicholas has a preliminary demo, when one can login into the Linux configured system via his website and run multiple benchmarks using precompiled scripts. Using his system, it is be easier for people to evaluate different ML frameworks without the inevitable learning and setup curve of using each system independently. Since we really liked his idea, we invited Nicholas to give a demo at our coming GraphLab bay area workshop.

Currently, Nicholas includes the following systems in his distribution: GraphLab, Mahout, MadLIB and Vowpal Wabbit. He further asked us to help him compile some interesting datasets to test the different systems on.

Not long ago, during our Intel Labs-GraphLab collaboration, the Intel guys have also asked for help in creating some kind of baseline benchmark suite for comparing different machine learning methods and implementations. Additionally, Edin Muharemagic an architect @ HPCC systems have asked for help in assembling several large datasets for comparing their new SVD solver, as part of their new LexisNexis ML library.

Now it seems that many people find the same basic idea useful then we should probably make an effort in this direction. For a start, I am looking in providing some benchmarks for large scale sparse SVD problems. So why not crowd source this effort with the help of my readers?

As a first try, I contacted Joshua Vogelstein from Johns Hopkins who kindly offered to donate neural data. Now I am looking for some additional researchers who are willing to donate some large scale data for some real SVD / spectral clustering tasks in different fields. Matrices should be in the size between 100M non-zeros to 10 billion non-zeros. Please email me if you are intesting in helping create a standard benchmarking suite!

[An update]
Now I got a note from Andrew Aolney, from the University of Memphis that he contributes a wikipedia term occurrence matrix to the benchmark collection.  I also got some wikipedia term occurrences matrices from: Jamie Callan, Brian Murphy, and Partha Talukdar, CMU. Thanks everyone!

The datasets are added to our GraphLab datasets page here.

No comments:

Post a Comment