Thursday, January 31, 2013

(Relatively) new algorithm for parallel solution of a linear system

I got this interesting link from Igor Carron, the master of compressed sensing. Here is a quote from his blog post:

In this note, following suggestions by Tao, we extend the randomized algorithm for linear equations over prime fields by Raghavendra to a randomized algorithm for linear equations over the reals. We also show that the algorithm can be parallelized to solve a system of linear equations $A x = b$ with a regular $n \times n$ matrix $A$ in time $O(n^2)$, with probability one. Note that we do not assume that $A$ is symmetric.

==> Looks like a cool and simple algorithm to implement in parallel for solving linear system. If anyone implemented it in parallel let me know! Also if you have some matlab code that would be interesting.

When digging slightly dipper I saw the following claim in the paper:
So it seems like an O(n^3) operations after all. (The O(n^2) is achieved on each of the n cpus).

Signal reconstruction

I got this hilarious video from my friend Dror Baron who is drinking too much coffee at Starbucks in North Carolina State University. It seems that caffeine levels are finally starting to get him.

Spotlight: Asterix - UC Irvine

I got this interesting lecture from my boss Prof. Carlos Guestrin. He attended an interesting talk by Michael Carey from UC Irvine about their system Asterix.

Asterix is a big data management system with an open source Apache license. Asterix is based on the hyracks data parallel platform. On top of it there are multiple services: map reduce service, pregel service, and even hive and pig functionality.

Here is a performance graph showing pregelx performance relative to Hadoop:

You can read more about Asterix here.

Saturday, January 12, 2013

Graph Database Resources

I got this from my collaborator Joey Gonzalez:
A paper that summarizes the state of graph databases that might be worth reading:
A nice paper describing how databases systems are built.  In particular it talks about the isolation of storage and computation dependencies in a database:
Regarding actual performance of databases for Graphs, I got an interesting link from my collaborator Yucheng Low:
I found an interesting benchmark comparing MySQL NDB against Memcached you may be interested in.
Summary: MySQL NDB faster than Memcached.
Really only faster if the entire NDB table can fit in memory (and disk write flushes are disabled). If HDD IO is necessary, it slows down quite a lot.  Of course, MySQL sharding+replication can be used to keep things running instead of going to disk.

Additional interesting resource I got from my collaborator Aapo Kyrola, regarding Twitter's FlockDB implementation which implements a graph database in twitter:
The blog post by the Twitter engineering team discusses in quite a lot of detail how they extract so much performance from MySQL, worth a read:  
Our goals were: 
  • Write the simplest possible thing that could work. 
  • Use off-the-shelf MySQL as the storage engine, because we understand its behavior — in normal use as well as under extreme load and unusual failure conditions. 
  • Give it enough memory to keep everything in cache. 
  • Allow for horizontal partitioning so we can add more database hardware as the corpus grows. 
  • Allow write operations to arrive out of order or be processed more than once. (Allow failures to result in redundant work rather than lost work.) FlockDB was the result. 

I got from Carlos Guestrin an Overview of SQL vs. no-SQL data stores.

Friday, January 4, 2013

Machine learning tools for biomedical data?

I got this nice email today from Xinghua Lou, a computational data scientist from Sloan-Kattering/Cornell Medical:

Hi Danny,
My name is Xinghua Lou and I follow your blog and watched your GraphLab demo at NIPS (particularly the interactive topic model demo, cool).
I am working for Sloan-Kettering/Cornell Med and we need machine learning for mining massive biomedical datasets (genomics, medical records, etc). I am making a survey of large scale machine learning toolboxes/frameworks and hope to have some suggestion from you.
I know your GraphLab, along with other toolboxes such as Wabbit from Yahoo, Shogun from MPI Germany, and recently Google's large deep networks - feels like the beginning of Skynet :). I am wondering if I have missed any other important tools. If so, please let me know.
Thanks a lot for your consideration. Wish you a nice weekend.
Best regards,Xinghua Lou

Addionally, a couple of weeks ago I had a very interesting conversation with a great guy (with the longest title I ever found in LinkedIn!)  Alon Ben Ari, Director of Regional anesthesia VA Puget Sound Health Care system at University of Washington at VA medical Center Seattle. VA hospital are also looking at large scale data analytics for finding the right ML tools to deploy.

My guess is that there are probably many other medical institutions who are right now looking for tools for large scale data analytics. Why don't we setup (a voluntary) task force to look identify common biomedical problems, existing machine learning tools, and what are the futures tools that are missing? Anyone who is interested in welcome to contact me.

Here is a list of some useful ML tools:
Definitely VW is a great tool for regression/classification. Liblinear is also a great too. Our GraphLab subproject GraphChi is gaining a lot of popularity recently.

And here are two additional tools, thanks to my mega collaborator JustinYan: sofia-ml is a great tool for combined regression and ranking. svm-perf is a good SVM solver that can directly minimize ROC and precision/recall measures.

Interesting dataset: million songs dataset

As you probably all know we are always looking for additional free, high quality datasets to try some of our techniques on. I got the million songs dataset link from Clive Cox, Chief Scientist at Rammble Labs, our man in London.

Here is some information from their website:

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital, using code we provide.
The Million Song Dataset is also a cluster of complementary datasets contributed by the community:

Here is information on getting the dataset. Kaggle managed a contest for rating music items drawn from this dataset. For evaluating performance they used MAP@500 metric described here. Anyway I am soon going to try out our GraphChi CF toolbox on this dataset. Keep posted for some results!

An update: as promised, here are some GraphChi runtime results deployed on the million songs dataset and instructions how to reproduce them.