Wednesday, October 28, 2015

News from Stockholm: HopsWorks

A couple of weeks ago I visited Stockholm, I was kindly invited to give a keynote at the SICS Data Science Day. Thanks again for Prof. Seif Haridi for his kind invitation!

One of the interesting lectures there was by Prof Jim Downling, previously from mysql. Jim is building an open source system called HopsWorks to improve the Hadoop experience. Jim have kindly answered my questions about the project.

When did the project start?
Hops started in 2011, we started with HDFS. In 2013, we started YARN. In 2014, we started HopsWorks. So, it's about 14 months in development now.

What is the project goal?
The project goal is to make Hadoop for humans. We want to make it easier for people to store, process, and share data in Hadoop. That means moving away from the command-line to graphical interfaces for non programming tasks. Everything from managing access to data to sharing datasets should be accessible for people who are not data engineers.

What is the project license?
The project is licensed as a mix of Apache v2 and GPL v2. Our extensions to Hadoop are Apache v2 licensed, but we have connectors to the NewSQL database that we support (NDB - MySQL Cluster), and they have to be licensed in GPL v2. Because of the mixed licensing model, we don't provide a single distribution. However, users can install HopsWorks with 6 mouse clicks using our recommended installation tool,

Who is using HopsWorks?
We have had most interest from companies and organizations with sensitive datasets that require sharing those datasets in a controlled manner with users. So, Ericsson are interested enabling non-Ericsson employees to do analysis on some of their DataSets without requiring the long process of signing NDAs. As HopsWorks has a data scientist role (who cannot import or export data from the system), they could provide access to external data scientists, knowing they have an audit trail for actions by the external users and that the external users cannot download the dataset or derived data from the cluster. In the area of Genomics, we have a lot of interest as well.

Can you share performance numbers?
I don't have figures for Sentry's performance. The figures I showed were for the state-of-the-art Policy enforcement points (XACML). Sentry is trying not to do any enforcement and is basically sending all of its rules to all of the services to be cached there (HDFS, Solr, Impala, etc). My guess is that Sentry itself can still only handle a few 100 ops/sec. The main problem it has is how to keep the privileges and the data consistent. I don't see how they can do that for all Hadoop services. 
Here's Cloudera's own report on the slowdown of turning on Sentry for Solr (it leads to a 20% slowdown for Solr - even with most privileges being stored in Solr):
They admit that Sentry "doesn't scale to store document-level [privileges]", so they store policies in Solr instead (breaking the assumption that Sentry is the central store for all policies (privileges).

Can you share a video of your talk?
The talk from last week is up:

And here are some screenshots:

No comments:

Post a Comment