Tuesday, March 5, 2013

Intel Labs report on GraphLab vs. Mahout

I have some very interesting news to report. I got from Nezih Yigitbasi, Intel Labs, the following graph:


It compares Mahout vs. Distributed GraphLab on the popular task of matrix factorization using ALS algorithm (alternating least squares) on Netflix data. The bottom line is that GraphLab is about x20 faster than Mahout.

And here is the exact experiment setup, I got from Nezih:

  • N is the number of ALS iterations, D is the number of latent factors. The experiments have been conducted on a 16 node cluster. 
  • We start GL as mpirun -hostfile ~/hostfile -x CLASSPATH ./als –ncpus=16 --matrix hdfs://host001:19000/user/netflix --D=$LATENT_FACTOR_COUNT --max_iter=$ITER_COUNT --lambda=0.065 --minval=0 --maxval=5 
  • To run mahout ALS, we use the factorize-netflix.sh script under the examples directory. It should be run as ./factorize-netflix.sh /path/to/training_set/ /path/to/qualifying.txt /path/to/judging.txt 
  • In our test cluster we have 16 machines each with 64GB of memory, 2 CPUs (Intel(R) Xeon(R) CPU E5-2670 @ 2.60GHz [8 cores each]) and 4 x 1 TB HDDs. The machines communicate over a 10Gb Ethernet interconnect. 
  • The Netflix dataset has been splitted into 32 equally sized chunks and then put into HDFS.

16 comments:

  1. Which version of Mahout?

    Mahout trunk has made substantial improvements in speed lately.

    The asynchronous style of GraphLab can still have substantial benefits, but it isn't as bad as this looks.

    In the area of other algorithms such as clustering, Mahout has made multiple order of magnitude improvements and may now be considerable faster than Graphlab.

    One other important point is the matter of goals. Mahout's primary goal is practical scalability. As such, running under a dominant scaled paradigm such as map-reduce is critical. This imposes substantial penalties in some algorithms such as ALS.

    ReplyDelete
  2. What do you mean by "dominant scaled paradigm"?

    ReplyDelete
  3. I see lots of problems with this benchmark, unfortunately.

    First, the Netflix dataset has about 1.5GB in text format. Splitting it into 32 chunks means that each chunk is smaller than the default blocksize used by Hadoop. This means that you will run only 32 mappers on a cluster with 16 * 16 = 256 cores. ALS is a CPU intensive problem and this setup uses only 1/8th of the available computing power.

    Second, you should use the latest version of ALS in Mahout, which is able to run multithreaded and configure it in an optimal way (e.g. enable jvm reuse for mappers).

    Third, the example bash script that is used for Mahout includes a command line program to preprocess the data and starts a second job that computes the RMSE of the factorization. None of this has something todo with the performance of ALS.

    ReplyDelete
  4. To expand on Sebastian's comments, the trunk version of ALS is about 2-3 times faster lately due to a nearly 10x improvement in the QR decomposition speed. If you combine that with using 1/8th of the available cores, you get an estimate that the speed of Mahout is somewhere between 16/22 to 24/22 relative to Graphlab. That is, not much detectable difference.

    If you then include the fact that the pre-processing and error estimation phases of Mahout should be removed, it may well be that Mahout dominates GL on speed.

    In fact, of course, with a flawed benchmark the real result is that you know nothing new.

    ReplyDelete
  5. Hey guys,
    Some clarifications here.
    - Data preprocessing and RMSE calculation are NOT included in the timings. I only timed the ALS computation.
    - The dataset is splitted into 32 chunks ONLY for GraphLab, not for Mahout.
    - I will definitely have a look at Mahout trunk to see how much it improves the results.

    Nezih

    ReplyDelete
    Replies
    1. If you didn't split the dataset for Mahout, its even worse. The dataset will live in 24 hdfs blocks, so you will be using only 24 mappers, less than 1/10 of the clusters computing power...

      Delete
    2. Can you recommend the optimal number of splits to increase performance? (Does it depends on the number of machines and cores? If so please provide some guidelines)

      Delete
    3. In order to get optimal performance, you should use the trunk of Mahout and set the blocksize so that one split per machine is created.

      Then you should use one mapper per machine, as the latest version of Mahout uses multithreaded mappers (-Dmapred.tasktracker.map.tasks.maximum=1) and set --numThreadsForSolver to the number of cores on the machine (or one less).

      Delete
    4. Update: I missed one thing, you also have to adjust the number of reducers, as the number of mappers that are run for the factorization depends on the number of reducers that were used in the preprocessing stage. The number of cores that you used is not the number of blocks then, but equal to the number of reducers. So set the number of reducers equal to the number of machines for Mahout's trunk.

      Delete
  6. Hi guys,
    I am very excited to have this discussion. The reason I posted it in my blog is that I hoped for such comments. Since Intel Labs is definitely objective, the would love getting any tips from you about tuning the algorithm performance, we would love to give mahout as fair chance as possible.

    ReplyDelete
  7. Hi, Danny!

    Does the ALS-WR method based on Y. Koren et al. paper "Collaborative Filtering for Implicit Feedback Datasets" differ from WALS, presented here? I'm asking, because GraphChi needs two parameters, rating AND weight, while Mahout impl of ALS-WR requires only confidence ratings (r_ij according to the article)? Probably I missed something, but I want to compare the results with implicit feedback and no numeric ratings

    ReplyDelete
    Replies
    1. Apologies for the slow reply..
      My guess is that Koren's paper is concerned with implicit datasets which means binary event (click non-click) while graphchi implementation is more general, since it allows both binary and weighted ratings. If your data is implicit just put a rating 1 when creating graphchi input.

      Best,

      Delete
    2. Thank you, that was the exact refinement I wanted to hear! But isn't it an unwanted mix of notions here? Is there any sense in modeling user's confidence while gathering explicit ratings like "user is confident in rating A, but is not very sure of rating B for the same item"? Or the factorization method "doesn't care" about "physical meaning" of the numbers?

      Sorry if you find my questions unrelated to GraphChi/Mahout discussion.

      Delete
    3. Yes, definitely. Uncertainty adds an additional dimension to the input.
      For example, we know that Obama was born in the US with 99.99% certainty (not 100% since some people still claim he was born in Kenya).
      But some other people we are not sure about (maybe because several people have same name, some information is missing, or some records where lost) so we can say someone was born in the US with 80% certainty.

      Delete
  8. Any updates from Intel Labs using an optimised Mahout setup as suggested by Sebastian?

    ReplyDelete