Sunday, October 20, 2013

LSRS 2013 - A great success!

I just heard from LSRS 2013 co-organizers Tao Ye (Pandora) and Quan Yuan (Taobao) that LSRS (large scale recommender systems) workshop we organized as part of RecSys 2013 was a great success. More than 130 researchers attended - it was the workshop with the highest attendance.
Below are some images, with credit to Tao Ye:


There where two notable keynotes. The first from Justin Basilico from netflix. The second from Aapo Kyrola from UW about collaborative filtering with GraphChi. The full papers, along with other interesting talks are found here.

Thanks to Pandora for hosting the happy hour just after the workshop:




Stay tuned for next year's workshop - this time in SF!


Friday, October 18, 2013

Total subgraph communicability in GraphLab

Our user Jacob Kesinger has implemented total subgraph communicability in GraphLab v2.2.
By the way, Jacob is looking for a data scientist position in the area between San Jose and Palo Alto. So if you are looking for PhD in math, with a lot of applied knowledge, I truly recommend Jacob!
(And tell him I sent you.. :-) Anyway he is his description of the method:



I'm posting to the list to announce that I've implemented Total
Subgraph Communicability, a new centrality measure due to
Benzi&Klymco[0].   For a directed graph with adjacenty matrix A, 

TSC_i = sum_j exp(A)_{ij} = (exp(A)*1)_i.

This code calculates the TSC using an Arnoldi iteration on the Krylov
subspace {b, Ab,A*Ab, A*A*Ab, ...}  due to Saad[1], and using the new
warp engine from Graphlab 2.2 (without which this would have been, at
best, very challenging).

I don't have access to a cluster right now, so I can't test that it performs
properly in the distributed environment, but it works fine on a single
macbook air (and can compute TSC for a graph with 130K nodes and 1.4M
edges in less than a minute under load).  

Small components of large graphs will have bogus answers due to
floating point issues.  To find the exact TSC for a particular node i,
run with "--column i" to find exp(A)*e_i; you will have to sum the
resulting output yourself, however.


SAMPLE INPUT:
0 1
1 2
1 3
2 4
3 4
1 0
2 1
3 1
4 2
4 3
OUTPUT:
0 5.17784
1 10.3319
2 8.49789
3 8.49789
4 7.96807


You can verify this in python as:
import scipy
import scipy.linalg
A = scipy.array([[0,1,0,0,0],[1,0,1,1,0],[0,1,0,0,1],[0,1,0,0,1],[0,0,1,1,0]])
scipy.linalg.expm2(A).sum(axis=1)


[0]: Benzi, Michele, and Christine Klymko. Total Communicability as a Centrality Measure. ArXiv e-print, February 27, 2013.http://arxiv.org/abs/1302.6770.

[1]: Saad, Yousef. “Analysis of Some Krylov Subspace Approximations to the Matrix Exponential Operator.” SIAM Journal on Numerical Analysis 29, no. 1 (1992): 209–228.

Thanks Jacob for your great contribution!!

Wednesday, October 16, 2013

Monday, October 14, 2013

Friday, October 11, 2013

Kaggle Titanic Contest Tutorial

I found this great tutorial written by Andrew Conti, a statistician & consultant from NY. The tutorial is using ipython as an interactive way for teaching data scientists how to better understand the data and perform classification using various tools.

The topic is the Titanic contest I mentioned here earlier.

Here is a snapshot:
What is nice about this tutorial is that it is interactive, you can follow by running each step, changing parameters on the fly.

Wednesday, October 9, 2013

First Spark Summit

Databricks, the newly founded Spark based company has kindly asked us to share our event organizer, Courtney Burton, for their first Spark Summit.

The event will take place on Monday December 2, 2013 in SF.

Readers of my blog are welcome to use the following discount code.

Monday, October 7, 2013

New SOSP paper: a lightweight infrastructure for graph analytics

I got this reference from my collaborator Aapo Kyorla, author of GraphChi.

A Lightweight Infrastructure for Graph Analytics. Donald Nguyen, Andrew Lenharth, Keshav Pingali (University of Texas at Austin), to appear in SOSP 2013.

It is an interesting paper which heavily compares to GraphLab, PowerGraph (GraphLab v2.1) and
GraphChi.

One of the main claims is that dynamic and asynchronous scheduling can significantly speed up many graph algorithms (vs. bulk synchronous parallel model where all graph nodes are executed on each step).

Some concerns I have is regarding the focus on multicore settings, which makes everything much easier, and thus to comparison with PowerGraph less relevant.

Another relevant paper which improves on GraphLab is: Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC. Zhiyuan Lin, Duen Horng Chau, and U Kang, IEEE Big Data Workshop: Scalable Machine Learning: Theory and Applications, 2013. The basic idea is to speed graph loading using mmap() operation.