Large Scale Machine Learning and Other Animals: November 2013

Thursday, November 21, 2013

GraphLab Seattle Users Meetup - Video Online!

Thanks so much for Clive Boulton for his great help in organizing and video capturing our event.
Here is the talk video. I will post the slides soon.

Wednesday, November 20, 2013

Big data research positions

I got contacted by Bosch company who are looking to extend their Palo Alto research center, headed by Soundar Srinivasan. They have 4 open positions:

Tuesday, November 12, 2013

PowerLyra

We got today the following email from Rong Chen, Shanghai Jiao Tong University:

Hi, GraphLab Experts,

I'm from IPADS group, Shanghai Jiao Tong University, China. This email is aimed at a first time disclosure of project PowerLyra, which is a new hybrid graph analytics engine based on GraphLab 2.2 (PowerGraph).

As you can see, natural graphs with skewed distribution raise unique challenges to graph computation and partitioning. Existing graph analytics frameworks usually use a “one size fits all” design that uniformly processes all vertices and result in suboptimal performance for natural graphs, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab), or incur high communication cost among vertices even for low-degree vertices (e.g., PowerGraph).

We argued that skewed distribution in natural graphs also calls for differentiated processing of high-degree and low-degree vertices. We then developed PowerLyra, a new graph analytics engine that embraces the best of both worlds of existing frameworks, by dynamically applying different computation and partition strategies for different vertices. PowerLyra uses Pregel/GraphLab like computation models for process low-degree vertices to minimize computation, communication and synchronization overhead, and uses PowerGraph-like computation model for process high-degree vertices to reduce load imbalance and contention. To seamless support all PowerLyra application, PowerLyra further introduces an adaptive unidirectional graph communication.

PowerLyra additionally proposes a new hybrid graph cut algorithm that embraces the best of both worlds in edge-cut and vertex-cut, which adopts edge-cut for low-degree vertices and vertex-cut for high-degree vertices. Theoretical analysis shows that the expected replication factor of random hybrid-cut is always better than both random vertex-cut and edge-cut. For skewed power-law graph, empirical validation shows that random hybrid-cut also decreases the replication factor of current default heuristic vertex-cut (Grid) from 5.76X to 3.59X and from 18.54X to 6.76X for constant 2.2 and 1.8 of synthetic graph respectively. We also develop a new distributed greedy heuristic hybrid-cut algorithm, namely Ginger, inspired by Fennel (a greedy streaming edge-cut algorithm for a single machine). Compared to Gird vertex-cut, Ginger can reduce the replication factor by up to 2.92X (from 2.03X) and 3.11X (from 1.26X) for synthetic and real-world graphs accordingly.

Finally, PowerLyra adopts locality-conscious data layout optimization in graph ingress phase to mitigate poor locality during vertex communication. we argue that a small increase of graph ingress time (less than 10% for power-law graph and 5% for real-world graph) is more worthwhile for an often larger speedup in execution time (usually more than 10% speedup, specially 21% for Twitter follow graph).

Right now, PowerLyra is implemented as an execution engine and graph partitions of GraphLab, and can seamlessly support all GraphLab applications. A detail evaluation on 48-node cluster using three different graph algorithms (PageRank, Approximate Diameter and Connected Components) show that PowerLyra outperforms current synchronous engine with Grid partition of PowerGraph (Jul. 8, 2013. commit:fc3d6c6) by up to 5.53X (from 1.97X) and 3.26X (from 1.49X) for real-world (Twitter, UK-2005, Wiki, LiveJournal and WebGoogle) and synthetic (10-million vertex power-law graph ranging from 1.8 to 2.2) graphs accordingly, due to significantly reduced replication factor, less communication cost and improved load balance.

The website of PowerLyra: http://ipads.se.sjtu.edu.cn/projects/powerlyra.html

The latest release has ported to GraphLab 2.2 (Oct. 22, 2013. commit:e8022e6), which aims to provide best compatibility with minimum changes to framework (Perhaps, only add a "type" field to vertex_record.). But this version has no locality-conscious graph layout optimisation now. You can check out the branch from IPADS's gitlab server: git clone http://ipads.se.sjtu.edu.cn:1312/opensource/powerlyra.git

I did not have time to try it out yet, but it definitely looks like an interesting research direction.

Monday, November 11, 2013

Online Machine Learning Course by Alex Smola & Geoff Gordon (CMU)

Lecture videos should be online:
http://alex.smola.org/teaching/cmu2013-10-701x/index.html

The course is targeted for graduate students.

Wednesday, November 6, 2013

Hunch's Taste Graph

Again, I get this interesting link from my collaborator Chris DuBois. A blog post about Hunch graph,
which is a nice example on how graph data can improve recommendations. And got them bought by ebay a couple of years ago.

Notable presentation: Datapad @ Strata NY+ PyData

Another interesting presentation I got from my collaborator Chris DuBois.

If you don't know Wes Mckinney, Python Pandas creator and the author of the great book "Python for Data Analysis", you must do yourself a favor and buy this book. It is #1 useful tool for any task involving data analytics.

Anyway, not long ago Wes have founded a company called Datapad who is doing something secretive regarding data analytics. Wes gave an interesting Strata talk, where he did not reveal anything about Datapad, but gave a good overview of companies in the data analytics preparation domain.

In PyData NY, a more detailed talk about the shortcomings of Pandas.

And guess what? Wes just agreed to give a Datapad talk at our 3rd GraphLab conference!! I can't wait to learn more about Datapad.

Monday, November 4, 2013

The 3rd GraphLab Conference is coming!

We have just started to organize our 3rd user conference on Monday July 21 in SF. This is a very preliminary notice to attract companies and universities who like to be involved. We are planning a mega event this year with around 800-900 data scientists attending, with the topic of graph analytics and large scale machine learning.

The conference is a non-profit event held by GraphLab.org to promote applications of large scale graph analytics in industry. We invite talks from all major state-of-the-art solutions for graph processing, graph databases and large scale data analytics and machine learning. We are looking for sponsors who would like to contribute to the event organization.

Preliminary talks:

Reynold Xin, co-Founder of Databricks will present Spark
Wes McKinney, Founder & CEO of DataPad - TBA
Prof. Carlos Guestrin, Founder & CEO of GraphLab will present GraphLab
Prof. Vahab Mirrokni from Google's Pregel team - TBA
Prof. Joe Hellerstein, Founder & CEO of Trifacta - TBA
Tao Ye, Senior Scientist, Pandora Internet Radio - TBA
Josh Wills, Director of Data Science at Cloudera - TBA
We hope to get a talk from Dr. Avery Ching from Facebook about Giraph.

Preliminary program committee:

Prof. Joe Hellerstein, Founder & CEO Trifacta & Berkeley
Prof. Carlos Guestrin, CEO GraphLab & UW
Mr. Michael Draugelis, Chief Data Scientist, Lockheed Martin
Mr. Eric Bieschke, Chief Scientist & VP Playlist, Pandora Internet Radio
Mr. Abhijit Bose, VP Data Science, American Express
Mr. Richard Mallah, Director of Unstructured and Big Data Analytics, Cambridge Semantics
Mr. Steven Hillion, VP Product, Alpine Data Labs
Dr. Jim Kim, VP Product, Skytree
Prof. Josep Lluís Larriba Pey, Universidad Polytecnica Di Catalunia

Sponsors:

The second GRADES workshop, to be held on June 22, 2014 at the premier database systems conference ACM SIGMOD/PODS in Snowbird (Utah), attracts database systems architects, graph data management researchers and practitioners to describe and discuss scenarios, experiences and system internals encountered in managing and analyzing large quantities of graph-shaped data. The GRADES workshop is co-sponsoring the third GraphLab Conference.

Notable presentation: Mark Levy's Recsys talk

I got this from my collaborator Chris DuBois who sent me Mark Levy's Recsys talk. Previously from last.fm, Mark has generously contributed an implementation of the CLiMF algorithm to GraphChi collaborative filtering toolkit.

What's nice about this talk is examines some of the recent data competition and points some flaws in the way they were constructed.

Large Scale Machine Learning and Other Animals