Monday, December 12, 2011

MPI vs. Hadoop

Here is an interesting blog post from John Langford (Yahoo! Research NY) about some of the pros and cons of MPI vs. Hadoop. Different approaches to ML are discussed and summarized by John as follows:
The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:
  • For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
  • For smaller datasets with computationally intense models, GPU approaches seem very compelling.
  • For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug. The model needs to fit into memory. If that’s not the case, then other approaches are required.

Anyone who reads this blog and is attending the NIPS big ML workshop is encouraged to contact me since I plan to be there! Note that the asynchronous approach reference (bullet 3) actually refers to our GraphLab project.  I liked the phrasing: it's scary in practice..  But we do support also BSP execution so it is always possible to debug that way.

No comments:

Post a Comment