Saturday, March 1, 2014

Mahout or Oryx? Hadoop based analytics front is heating up

Got this from my Colleague Eric Wolfe: Gigaom blog post which backs up Oryx, an open source by Sean Owen, a previous Mahout contributor who crossed the lines and now creating a new system.

Apache Mahout, the traditional avenue for building machine learning models in Hadoop, “has reached the end of its road,” Owen said. It’s stuck in a batch-only first-generation MapReduce era, and it requires a lot of work on users’ parts to get a working system in place. “

A heated discussion was recorded a couple of months ago. For example, one of the main Mahout contributors, Sebastian Schelter does not stay idle:
..., I also cannot understand why Cloudera and you need to start a new open source project that in many ways mirrors what mahout offers. Why not contribute the algorithm implementations (the computation layer) to mahout and built the serving layer as a project on top of that? I don't see what would have prevented this, I would think it would have been warmly welcomed by this community.

It is not that this new project creates competition from which users will benefit, its exactly the opposite. To me it feels like an intentional abandonment of mahout. Instead of giving users a single project where we could have united efforts, users now have to choose between two things that in general do the same things with each of them missing some functionality. In my eyes, users lose here.


Its a very bad day for mahout today.

One of the reasons beyond this controversy is that Mahout is backed up by MapR who is backed up by EMC. From the other hand Oryx is backed up by Cloudera. Both MapR and Cloudera have competing Hadoop versions.

Additional interesting note at the Gigaom article about Spark:
Owen is spending a lot of time contributing to the Apache Spark project because he plans to rewrite Oryx to make Spark the primary processing framework instead of MapReduce. “There’s actually a lot of reasons to be interested in Spark from a machine learning point of view,” he said. “… I’d much rather put my energies there.”
He’s not alone. As we have explained, Spark is becoming a popular choice for next-generation big data applications and companies such as Cloudera and Hortonworks are embracing it as a big part of Hadoop’s future. 

1 comment:

  1. Indeed, these are surprisingly 'heated'. I do think the article yesterday over-emphasized, in its paraphrases, a negative tone towards Mahout, which wasn't my intent.

    I find some comments you cite above hard to understand, from people who should know better. I remain the single biggest contributor to Mahout, by myself. For people who have done less for the project to suggest I am 'damaging' the project seems hypocritical.

    The project has technical and community dysfunction. I think many would agree, and because I end up being the only messenger, I take the arrows. This is part of the reason why a different project was created; that, and, the design goals are just quite different.

    It's tempting to view this through a lens of simple vendor sparring. Indeed on the thread you cite, MapR's Ted showed he feels OK making Apache lists a vendor battleground, using it to accuse me of being "paid off" and conspiring to undermine Apache. Go read it!

    I don't agree with the implication floated there and sort of reflected here, that I (or Cloudera) somehow selectively back or "sabotage" projects just for commercial advantage. It's simple to verify that we have supported all of the things mentioned on this page, and more than any other vendor. Mahout is in no way a "MapR" project, even if that is indeed the only place I see MapR contributing anything.

    Hadoop is changing so fast, and it's great. It is no great loss if projects come and go as platforms and needs change. I don't feel bad about continuing to direct contribution to where I think it's most valuable. I am, to be honest, really bewildered by the slings and arrows here.