Large Scale Machine Learning and Other Animals: July 2012

Tuesday, July 17, 2012

Cray XMT

Cray XMT (Xtreme Multithreaded Technology) is a new hardware designed with graph algorithms in mind. Those of you who attended our GraphLab workshop may have seen the poster by YarcData, a Cray company about their data analtyics contest.

Today I got some more details from Venkat Krishnamurthy about the XMT platform. XMT contains special purpose CPUs. Each CPU runs 128 threads in parallel (up to 4096 cpus). The machine has one huge shared memory infrastructure (with up to 512TB of data). The machine is specially design for non-uniform memory access, no CPU cache. This allows for very efficient graph analytics.

Some of the problems we are straggling to solve in distributed graphlab are answered using this architecture. For example:

Balanced graph partitioning across machines - is not needed.
Dynamic and asynchronous methods are supported.
Tolerates long global latencies while continuing the computation locally.

Here are some impressive numbers from their presentation:

The drawback of the XMT machine it is price - I hear it is not cheap. For further details contact YarcData.

Additional material I got from Venkat:

Here's more information from the Cray web site

http://www.cray.com/Assets/PDF/products/xmt/CrayXMTBrochure.pdf

http://www.cray.com/Products/XMT/Product/Resources.aspx

Also here's tons of information from Sandia's own MTGL site with several detailed presentations on XMT and the MTGL library on several key graph algorithms

https://software.sandia.gov/trac/mtgl/wiki/MtglPresentations

and a presentation by David Mizell of the engineering team

http://wwwjp.cray.com/downloads/XMT-Presentation.pdf

YarcData is building a semantic database on top of the Cray XMT machine.

Monday, July 16, 2012

A new dog on the block: GraphChi

Anyone who attended our GraphLab workshop could not have missed the new GraphChi release.
It is a new project by Aapo Kyorla from CMU, with some surprisingly great performance results.
The basic idea is that computation is limited to a weak machine (mac mini) with a large hard drive or SSD drive. Instead of loading the full problem into memory, the problem is read in parts from disk.

It is hard to believe what kind of results Aapo extracts from a mac mini. For example:

Application	Input graph	Graph size	Comparison	GraphChi on Mac Mini (SSD)	Ref
Pagerank - 3 iterations	twitter-2010	1.5B edges	Spark, 50 machines, 8.1 min	13 min	1
Pagerank - 100 iterations	uk-union	3.8B edges	Stanford GPS (Pregel), 30 machines, 144 min	581 min	2
Web-graph Belief Propagation (1 iter.)	yahoo-web	6.7B edges	Pegasus, 100 machines, 22 min	27 min	3
Matrix factorization (ALS), 10 iters	Netflix	99M edges	GraphLab, 8-core machine, 4.7 min	9.8 min	4
Triangle counting	twitter-2010	1.5B edges	Hadoop, 1636 machines, 423 mins	55 min	5

Namely, one mac mini does the work of 1636 Hadoop machines, x8 faster!!!

If you don't believe it, you are welcome to download GraphChi. The software is open source with Apache license. It is written in C++ but now I hear Aapo is created a Java version as well.

Additional reading: MIT Tech Review.

First Graphlab video tutorial by Nick - BigDataR Linux

I just got this cool video tutorial from Nick Kolegraff, BigDataR Linux distributor:

It explains how to run the collaborative filtering package, GraphLab v1 (multicore) on Amazon EC2. Thanks Nick for this great effort!

I will next help Nick to produce some newer tutorials of GraphLab v2 and GraphChi.

Sunday, July 15, 2012

Large Scale Machine Learning and Other Animals

This week we are celebrating 100,000 page views!
Thanks to YOU - my amazing collection of readers!
Next (ambitious) milestone: is 1,000,000 page views.. :-)

Some more good news to report: we won the 3rd place in ACM KDD CUP track 2 - 2012 (out of 192 groups). We had a great team for the chinese academy of science:

Xingxing Wang
Shijie Lin
Dongying Kong
Liheng Xu
Qiang Yan
Siwei Lai
Liang Wu
Guibo Zhu
Heng Gao
Yang Wu
Danny Bickson
Yuanfeng Du
Neng Gong
Chengchun Shu
Shuang Wang
Fei Tan
Jun Zhao
Yuanchun Zhou
Kang Liu

I had a VERY minor part at the team success. Most of the credits are for the great chinese guys.
We are now submitting a paper called: "Click-Through Prediction for Sponsored Search
Advertising with Hybrid Models". As soon it is ready I will post it here.

You may recall that this is the 2nd year we get to a high place in this competition.

Tuesday, July 10, 2012

The GraphLab workshop is over!!

Thanks so much to the 320+ people attended - and to the great speakers - we have an amazing user community!

I can not think of a better way to demonstrate the buzz and excitement about GraphLab than to list some the twits we had yesterday. Here is a link to Carlos' lecture.

Monday, July 2, 2012

Amazing ML visualization software

I stumbled upon this by viewing Nirod's Priell mailing archives. MLDemos is a beautiful
ML visualization software which can be very useful when trying to understand the different ML methods.

I installed MLDemos easily using the MAC .dmg file. Here are some images I created
when playing with the software. There are two classes: red and white points.
Here is classification using kernelized SVM:

And the same problem solved using linear SVM:

And using sigmoids:

The disks are the support vectors.

The apparent drawback is that visualization is done using two dimensions. This software does not replace the need of studying ML methods, but gives a very nice visualization that
helps understand the different methods.

Large Scale Machine Learning and Other Animals

Tuesday, July 17, 2012

Cray XMT

Monday, July 16, 2012

A new dog on the block: GraphChi

First Graphlab video tutorial by Nick - BigDataR Linux

Sunday, July 15, 2012

Large Scale Machine Learning and Other Animals

Tuesday, July 10, 2012

The GraphLab workshop is over!!

Monday, July 2, 2012

Amazing ML visualization software

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax