Tuesday, November 29, 2011

Spotlight: Dmitriy Golovashkin - Oracle and the R project

A few days ago I got the following note from Dmitry, a principal stuff member at Oracle:

Hi Danny,

I found out about GraphLab just two days ago.

I was working on a MapReduce based QR factorization and whilst searching web for references, found your blog & GraphLab. No question, I am planning to learn more about the project. Looks very exciting!

In general, our group is focusing on in-database and Hadoop based
data mining and statistical algorithms development.
R http://www.r-project.org/ is a big part of it.

Kind regards,

As always I am absolutely thrilled for getting my readers feedback! I asked Dmitry if he can share some  more insight about R project and here is what he wrote:

R is huge in data mining and statistical camps.
The number of contributed packages is staggering, it is amongst the most complete and feature rich environments for statistical and data mining computing.
Another very important observation concerns the quality of some of the contributed packages: outstanding work & implementation.

The biggest problem with R has to do with its inherent data storage model: everything must be stored in memory and most algorithms are sequential.
For instance the notion of a matrix in R is captured in the following C one liner:
double* a = (double*) malloc(sizeof(double) * nElements));

It is possible to build R with a vendor-supplied matrix packages (BLAS and LAPACK) and thus have multithreaded matrix computations in R (which helps a lot).
However if the input does not fit into memory, then it is somewhat problematic to run even the simplest algorithms.

We enable R folks to carry out computations directly on the database data (no need to move the data out). The in-memory limitation has been lifted for some algorithms (not all of course).

More is here

Kind regards,

We definitely agree that R is very useful statistical package. In fact, one of our users, Steve Lianoglou from Weill Cornell Medical Collage ported our Shotgun solver package to R.   And here is an excerpt from Steve's homepage which summarizes his ambivalent relation to R:

I have a love/hate relationship with R. I appreciate its power, but at times I feel like I'm "all thumbs" trying use it to to bend the computer to my will (update: I actually don't feel like this anymore. In fact I'm quite comfortable in R now, but try not to get too frustrated if you're not yet ... it only took me about 6 months or so!).
If that feeling sounds familiar to you, these references might be useful.
  • The R Inferno [PDF]. "If you are using R and you think you're in hell, this is a map for you." I stumbled on this document after using R for about 8 months or so and I could still, sympathize with that statement.
  • An R & Bioconductor Manual by Thomas Girke

Anyway if you are a reader of this blog or a Graphlab user - send me a note!


  1. Hi Danny.
    Recently i found your PHD thesis about solving systems of linear equations using GaBP and then the GraphLab.
    I have two questions, what's the advantage of GaBP over algorithms like sparseLM in solving linear systems of equations?
    Is it possible to incrementally solve the "Ax=b" equation as new measurements are added to the matrices "A" and "b"?

    Thanks a lot

  2. Hi!
    My answer is here: http://bickson.blogspot.com/2011/08/quiet-rise-of-gaussian-belief.html

  3. Hah!

    I guess you never know when someone will quote your own ramblings.

    For the record: I do (now) rather enjoy programming in R. Maybe it's a stockholm syndrome-like effect, but ... isn't it always with whatever language you begin to settle in :-)

    I've largely left my "awkward R feelings" on my tips/tricks page so n00bs don't get discouraged too quickly.

    Perhaps of interest is that I also started toying with wrapping graphlab itself in an R package in my free time, but haven't looked at it again much recently -- I almost got the `gcluster` stuff to work, so ... I will revisit again soon.

    And a half-baked thought: there might be some clever tricks we can do to trick R to at least get an (external) handle on bigger in-memory-matrix-like objects (by holding an external pointer to a C++ object, say) to shoot back/forth to a graphlab backend of sorts ... and there was also an int64 package released a short while ago that might prove handy in working with 64bit ints (duh), and therefore even indexing such external huge objects if necessary (among other things).

    Anyway ... interesting times ahead!


  4. Would love to be updated in your GraphLab-R progress. And let us know if there is anything we can do to help!