Large Scale Machine Learning and Other Animals: A new dog on the block: GraphChi

Monday, July 16, 2012

A new dog on the block: GraphChi

Anyone who attended our GraphLab workshop could not have missed the new GraphChi release.
It is a new project by Aapo Kyorla from CMU, with some surprisingly great performance results.
The basic idea is that computation is limited to a weak machine (mac mini) with a large hard drive or SSD drive. Instead of loading the full problem into memory, the problem is read in parts from disk.

It is hard to believe what kind of results Aapo extracts from a mac mini. For example:

Application	Input graph	Graph size	Comparison	GraphChi on Mac Mini (SSD)	Ref
Pagerank - 3 iterations	twitter-2010	1.5B edges	Spark, 50 machines, 8.1 min	13 min	1
Pagerank - 100 iterations	uk-union	3.8B edges	Stanford GPS (Pregel), 30 machines, 144 min	581 min	2
Web-graph Belief Propagation (1 iter.)	yahoo-web	6.7B edges	Pegasus, 100 machines, 22 min	27 min	3
Matrix factorization (ALS), 10 iters	Netflix	99M edges	GraphLab, 8-core machine, 4.7 min	9.8 min	4
Triangle counting	twitter-2010	1.5B edges	Hadoop, 1636 machines, 423 mins	55 min	5

Namely, one mac mini does the work of 1636 Hadoop machines, x8 faster!!!

If you don't believe it, you are welcome to download GraphChi. The software is open source with Apache license. It is written in C++ but now I hear Aapo is created a Java version as well.

Additional reading: MIT Tech Review.